[Fix-17817][Master]Add workflow timeout event and handle#18063
[Fix-17817][Master]Add workflow timeout event and handle#18063njnu-seafish wants to merge 13 commits intoapache:devfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #17817 by introducing a workflow-level timeout lifecycle event/handler so that configured workflow timeout alerts are actually emitted by the master engine.
Changes:
- Add a new
WorkflowTimeoutLifecycleEvent+WorkflowTimeoutLifecycleEventHandler, and wire publishing fromWorkflowStartLifecycleEventHandler. - Add
TIMEOUTtoWorkflowLifecycleEventTypeto support routing the new event. - Add service/DAO plumbing for sending workflow-timeout alerts and refactor alert persistence helper naming; tighten task-timeout validation.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/alert/WorkflowAlertManager.java | Adds sendWorkflowTimeoutAlert entrypoint to reach DAO alert creation. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java | New handler to send workflow timeout alerts when the timeout event fires. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/handler/WorkflowStartLifecycleEventHandler.java | Publishes workflow timeout event at workflow start (if timeout configured). |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/event/WorkflowTimeoutLifecycleEvent.java | New delay event representing workflow timeout. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/WorkflowLifecycleEventType.java | Adds TIMEOUT event type. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/lifecycle/event/TaskTimeoutLifecycleEvent.java | Tightens validation to require timeout minutes > 0. |
| dolphinscheduler-dao/src/main/java/org/apache/dolphinscheduler/dao/AlertDao.java | Adds validation for workflow timeout alert inputs and consolidates timeout-alert saving helper. |
| dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/enums/AlertType.java | Updates comment to reflect actual enum values. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| long delayTime = System.currentTimeMillis() - workflowInstance.getStartTime().getTime() | ||
| + TimeUnit.MINUTES.toMillis(timeout); |
|
|
||
| public static WorkflowTimeoutLifecycleEvent of(IWorkflowExecutionRunnable workflowExecutionRunnable) { | ||
| final WorkflowInstance workflowInstance = workflowExecutionRunnable.getWorkflowInstance(); | ||
| checkState(workflowInstance != null, "The workflow instance must be initialized before retrying."); |
| private void workflowTimeoutMonitor(final IWorkflowExecutionRunnable workflowExecutionRunnable) { | ||
| final WorkflowInstance workflowInstance = workflowExecutionRunnable.getWorkflowInstance(); | ||
| if (workflowInstance.getTimeout() <= 0) { | ||
| log.debug("The workflow {} timeout {} is invalided, so the timeout monitor will not be started.", |
| final int timeout = workflowInstance.getTimeout(); | ||
| checkState(timeout > 0, "The workflow timeout: %s must > 0 minutes", timeout); | ||
|
|
||
| long delayTime = System.currentTimeMillis() - workflowInstance.getStartTime().getTime() |
There was a problem hiding this comment.
When the workflow instance is rerun the startTime will not change. You should use restartTime .
There was a problem hiding this comment.
When the workflow instance is rerun the startTime will not change. You should use
restartTime.
good
| private void workflowTimeoutMonitor(final IWorkflowExecutionRunnable workflowExecutionRunnable) { | ||
| final WorkflowInstance workflowInstance = workflowExecutionRunnable.getWorkflowInstance(); | ||
| if (workflowInstance.getTimeout() <= 0) { | ||
| log.debug("The workflow {} timeout {} is invalided, so the timeout monitor will not be started.", |
There was a problem hiding this comment.
The log is incorrect.
fix it
| * Pause the workflow instance | ||
| */ | ||
| PAUSE, | ||
|
|
There was a problem hiding this comment.
Please don't change the unrelated code.
There was a problem hiding this comment.
Please don't change the unrelated code.
ok
added |
| postTaskCode: 1 | ||
| postTaskVersion: 1 | ||
| createTime: 2024-08-12 00:00:00 | ||
| updateTime: 2024-08-12 00:00:00 No newline at end of file |
There was a problem hiding this comment.
| updateTime: 2024-08-12 00:00:00 | |
| updateTime: 2024-08-12 00:00:00 | |
We should leave an blank line at the end of each file.
There was a problem hiding this comment.
We should leave an blank line at the end of each file.
ok
There was a problem hiding this comment.
Pull request overview
Adds workflow-timeout alerting to the Master workflow lifecycle by introducing a timeout lifecycle event + handler, wiring it into workflow start, and validating via a new master integration test.
Changes:
- Introduce
WorkflowTimeoutLifecycleEvent/WorkflowTimeoutLifecycleEventHandlerand addTIMEOUTto workflow lifecycle event types. - Publish workflow-timeout events on workflow start and add alert sending for workflow timeouts via
WorkflowAlertManager→AlertDao. - Add a master integration test + YAML fixture to assert a workflow-timeout alert is persisted.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/alert/WorkflowAlertManager.java | Adds sendWorkflowTimeoutAlert entrypoint delegating to AlertDao. |
| dolphinscheduler-master/src/test/resources/it/start/workflow_with_workflow_timeout_alert.yaml | New IT fixture defining a workflow with timeout + warningGroupId. |
| dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/integration/cases/WorkflowStartTestCase.java | New IT case asserting workflow-timeout alert is created. |
| dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/integration/WorkflowOperator.java | Passes warningGroupId into manual trigger request; extends DTO. |
| dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/integration/Repository.java | Adds query helper for alerts by workflow instance id. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java | New handler that sends workflow-timeout alerts. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/handler/WorkflowStartLifecycleEventHandler.java | Publishes workflow-timeout event when workflow starts and timeout is configured. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/event/WorkflowTimeoutLifecycleEvent.java | New delayed lifecycle event representing a workflow timeout. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/WorkflowLifecycleEventType.java | Adds TIMEOUT workflow lifecycle event type. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/lifecycle/event/TaskTimeoutLifecycleEvent.java | Tightens timeout precondition to > 0. |
| dolphinscheduler-dao/src/main/java/org/apache/dolphinscheduler/dao/AlertDao.java | Validates inputs and unifies timeout alert persistence helper method. |
| dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/enums/AlertType.java | Updates enum comment for alert type codes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| long delayTime = System.currentTimeMillis() - workflowInstance.getRestartTime().getTime() | ||
| + TimeUnit.MINUTES.toMillis(timeout); |
There was a problem hiding this comment.
WorkflowTimeoutLifecycleEvent.of assumes workflowInstance.getRestartTime() is non-null and uses it to compute delayTime. In practice restartTime can be null (there are call sites that explicitly fall back to startTime when it's missing), which would cause an NPE and prevent timeout handling. Additionally, the current delay calculation adds the elapsed time since restartTime to the configured timeout, which schedules the event later than the configured timeout if there is any delay between setting restartTime and publishing the event. Consider falling back to startTime when restartTime is null and computing the remaining delay until (restart/start time + timeout) (clamping to 0 if already exceeded).
| long delayTime = System.currentTimeMillis() - workflowInstance.getRestartTime().getTime() | |
| + TimeUnit.MINUTES.toMillis(timeout); | |
| final long baseTimeMillis = | |
| workflowInstance.getRestartTime() != null | |
| ? workflowInstance.getRestartTime().getTime() | |
| : (workflowInstance.getStartTime() != null | |
| ? workflowInstance.getStartTime().getTime() | |
| : 0L); | |
| checkState(baseTimeMillis > 0, | |
| "The workflow instance startTime and restartTime must not both be null when creating workflow timeout event."); | |
| final long timeoutMillis = TimeUnit.MINUTES.toMillis(timeout); | |
| final long deadlineMillis = baseTimeMillis + timeoutMillis; | |
| final long delayTime = Math.max(0L, deadlineMillis - System.currentTimeMillis()); |
There was a problem hiding this comment.
no need, restartTime is not null
| .dryRun(workflowTriggerDTO.getDryRun()) | ||
| .taskDependType(workflowTriggerDTO.getTaskDependType()) | ||
| .failureStrategy(workflowTriggerDTO.getFailureStrategy()) | ||
| .warningGroupId(workflowTriggerDTO.getWarningGroupId()) |
There was a problem hiding this comment.
WorkflowTriggerDTO now contains warningType, but manualTriggerWorkflow doesn't propagate it into WorkflowManualTriggerRequest (only warningGroupId is set). This makes the DTO field ineffective and can lead to confusion when writing tests that intend to control warningType. Either remove warningType from the DTO or pass it through to the request builder.
| .warningGroupId(workflowTriggerDTO.getWarningGroupId()) | |
| .warningGroupId(workflowTriggerDTO.getWarningGroupId()) | |
| .warningType(workflowTriggerDTO.getWarningType()) |
|
|
@ruanwenjun @SbloodyS Whenever you have time, I’d be grateful for your review. Thanks so much! |


Was this PR generated or assisted by AI?
NO
Purpose of the pull request
close #17817
This bug involves a critical feature and has been stagnant for several months. Since I actually resolved the task timeout and workflow timeout issue quite some time ago, I would like to finalize the process and get it closed.

Brief change log
Add workflow timeout event and handle
Verify this pull request
This pull request is code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(or)
Pull Request Notice
Pull Request Notice
If your pull request contains incompatible change, you should also add it to
docs/docs/en/guide/upgrade/incompatible.md